KS
Killer-Skills

Benchmark Manager — how to use Benchmark Manager how to use Benchmark Manager, Benchmark Manager setup guide, AILANG evaluation benchmarks, debugging workflows for AI models, Benchmark Manager alternative, Benchmark Manager vs other AI skills, install Benchmark Manager, what is Benchmark Manager, Benchmark Manager for AI agents, managing AI model evaluation benchmarks

v1.0.0
GitHub

About this Skill

Ideal for AI Agents like Claude, AutoGPT, and LangChain that require efficient AILANG evaluation benchmark management and debugging Benchmark Manager is a skill that manages AILANG evaluation benchmarks with features like prompt integration, debugging, and best practices for AI model evaluation.

Features

Manages AILANG evaluation benchmarks with correct prompt integration
Provides debugging workflows using scripts like show_full_prompt.sh
Supports testing benchmarks with specific models like claude-haiku-4-5
Checks benchmark YAML for common issues
Integrates with ailang eval-suite for model evaluation
Offers best practices learned from real benchmark failures

# Core Topics

sunholo-data sunholo-data
[0]
[0]
Updated: 3/6/2026

Quality Score

Top 5%
48
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add sunholo-data/ailang/Benchmark Manager

Agent Capability Analysis

The Benchmark Manager MCP Server by sunholo-data is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for how to use Benchmark Manager, Benchmark Manager setup guide, AILANG evaluation benchmarks.

Ideal Agent Persona

Ideal for AI Agents like Claude, AutoGPT, and LangChain that require efficient AILANG evaluation benchmark management and debugging

Core Value

Empowers agents to manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures, utilizing tools like ailang eval-suite and JSON parsing

Capabilities Granted for Benchmark Manager MCP Server

Debugging failing benchmarks with detailed prompt analysis
Testing benchmarks with specific AI models like claude-haiku-4-5
Validating benchmark YAML for common issues

! Prerequisites & Limits

  • Requires AILANG evaluation benchmarks setup
  • Limited to debugging workflows and prompt integration
Project
SKILL.md
7.4 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

Benchmark Manager

Manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures.

Quick Start

Debugging a failing benchmark:

bash
1# 1. Show the full prompt that models see 2.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse 3 4# 2. Test a benchmark with a specific model 5ailang eval-suite --models claude-haiku-4-5 --benchmarks json_parse 6 7# 3. Check benchmark YAML for common issues 8.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/json_parse.yml

When to Use This Skill

Invoke this skill when:

  • User asks to create a new benchmark
  • User asks to debug/fix a failing benchmark
  • User wants to understand why models generate wrong code
  • User asks about benchmark YAML format
  • Benchmarks show 0% pass rate despite language support

CRITICAL: prompt vs task_prompt

This is the most important concept for benchmark management.

The Problem (v0.4.8 Discovery)

Benchmarks have TWO different prompt fields with VERY different behavior:

FieldBehaviorUse When
prompt:REPLACES the teaching prompt entirelyTesting raw model capability (rare)
task_prompt:APPENDS to teaching promptNormal benchmarks (99% of cases)

Why This Matters

yaml
1# BAD - Model never sees AILANG syntax! 2prompt: | 3 Write a program that prints "Hello" 4 5# GOOD - Model sees teaching prompt + task 6task_prompt: | 7 Write a program that prints "Hello"

With prompt:, models generate Python/pseudo-code because they never learn AILANG syntax.

How Prompts Combine

From internal/eval_harness/spec.go (lines 91-93):

go
1fullPrompt := basePrompt // Teaching prompt from prompts/v0.4.x.md 2if s.TaskPrompt != "" { 3 fullPrompt = fullPrompt + "\n\n## Task\n\n" + s.TaskPrompt 4}

The teaching prompt teaches AILANG syntax; task_prompt adds the specific task.

Available Scripts

scripts/show_full_prompt.sh

Shows the complete prompt that models receive for a benchmark.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh <benchmark_id> 2 3# Example: 4.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse

scripts/check_benchmark.sh

Validates a benchmark YAML file for common issues.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/<name>.yml

Checks for:

  • Using prompt: instead of task_prompt: (warning)
  • Missing required fields
  • Invalid capability names
  • Syntax errors in YAML

scripts/test_benchmark.sh

Runs a quick single-model test of a benchmark.

Usage:

bash
1.claude/skills/benchmark-manager/scripts/test_benchmark.sh <benchmark_id> [model] 2 3# Examples: 4.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse 5.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse claude-haiku-4-5

Benchmark YAML Format

Required Fields

yaml
1id: my_benchmark # Unique identifier (snake_case) 2description: "Short description of what this tests" 3languages: ["python", "ailang"] 4entrypoint: "main" # Function to call 5caps: ["IO"] # Required capabilities 6difficulty: "easy|medium|hard" 7expected_gain: "low|medium|high" 8task_prompt: | # ALWAYS use task_prompt, not prompt! 9 Write a program in <LANG> that: 10 1. Does something 11 2. Prints the result 12 13 Output only the code, no explanations. 14expected_stdout: | # Exact expected output 15 expected output here

Capability Names

Valid capabilities: IO, FS, Clock, Net

yaml
1# File I/O 2caps: ["IO"] 3 4# HTTP requests 5caps: ["Net", "IO"] 6 7# File system operations 8caps: ["FS", "IO"]

Creating New Benchmarks

Step 1: Determine Requirements

  • What language feature/capability is being tested?
  • Can models solve this with just the teaching prompt?
  • What's the expected output?

Step 2: Write the Benchmark

yaml
1id: my_new_benchmark 2description: "Test feature X capability" 3languages: ["python", "ailang"] 4entrypoint: "main" 5caps: ["IO"] 6difficulty: "medium" 7expected_gain: "medium" 8task_prompt: | 9 Write a program in <LANG> that: 10 1. Clear description of task 11 2. Another step 12 3. Print the result 13 14 Output only the code, no explanations. 15expected_stdout: | 16 exact expected output

Step 3: Validate and Test

bash
1# Check for issues 2.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/my_new_benchmark.yml 3 4# Test with cheap model first 5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_new_benchmark

Debugging Failing Benchmarks

Symptom: 0% Pass Rate Despite Language Support

Check 1: Is it using task_prompt:?

bash
1grep -E "^prompt:" benchmarks/failing_benchmark.yml 2# If this returns a match, change to task_prompt:

Check 2: What prompt do models see?

bash
1.claude/skills/benchmark-manager/scripts/show_full_prompt.sh failing_benchmark

Check 3: Is the teaching prompt up to date?

bash
1# After editing prompts/v0.x.x.md, you MUST rebuild: 2make quick-install

Symptom: Models Copy Template Instead of Solving Task

The teaching prompt includes a template structure. If models copy it verbatim:

  1. Make sure task is clearly different from examples in teaching prompt
  2. Check that task_prompt explicitly describes what to do
  3. Consider if the task description is ambiguous

Symptom: compile_error on Valid Syntax

Common AILANG-specific issues models get wrong:

WrongCorrectNotes
print(42)print(show(42))print expects string
a % bmod_Int(a, b)No % operator
def main()export func main()Wrong keyword
for x in xsmatch xs { ... }No for loops

If models consistently make these mistakes, the teaching prompt needs improvement (use prompt-manager skill).

Common Mistakes

1. Using prompt: Instead of task_prompt:

yaml
1# WRONG - Models never see AILANG syntax 2prompt: | 3 Write code that... 4 5# CORRECT - Teaching prompt + task 6task_prompt: | 7 Write code that...

2. Forgetting to Rebuild After Prompt Changes

bash
1# After editing prompts/v0.x.x.md: 2make quick-install # REQUIRED!

3. Putting Hints in Benchmarks

yaml
1# WRONG - Hints in benchmark 2task_prompt: | 3 Write code that prints 42. 4 Hint: Use print(show(42)) in AILANG. 5 6# CORRECT - No hints; if models fail, fix the teaching prompt 7task_prompt: | 8 Write code that prints 42.

If models need AILANG-specific hints, the teaching prompt is incomplete. Use the prompt-manager skill to fix it.

4. Testing Too Many Models at Once

bash
1# WRONG - Expensive and slow for debugging 2ailang eval-suite --full --benchmarks my_test 3 4# CORRECT - Use one cheap model first 5ailang eval-suite --models claude-haiku-4-5 --benchmarks my_test

Resources

Reference Guide

See resources/reference.md for:

  • Complete list of valid benchmark fields
  • Capability reference
  • Example benchmarks

Related Skills

  • prompt-manager: When benchmark failures indicate teaching prompt issues
  • eval-analyzer: For analyzing results across many benchmarks
  • use-ailang: For writing correct AILANG code

Notes

  • Benchmarks live in benchmarks/ directory
  • Eval results go to eval_results/ directory
  • Teaching prompt is embedded in binary - rebuild after changes
  • Use <LANG> placeholder in task_prompt - it's replaced with "AILANG" or "Python"

Related Skills

Looking for an alternative to Benchmark Manager or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication